{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Exploration\n",
    "\n",
    "In this notebook, we'll take the opportunity to explore the data we will be using.  Data exploration and understanding is a fundamental step in Machine Learning.\n",
    "\n",
    "In this learning path, the data we'll be looking at is the Diabetes data set.  It consists of ~800 rows of medical data (e.g. blood pressure, heart rate etc...) for female patients with and without diabetes.  It is worth mentioning that in ML, 800 examples is considered a very very small dataset.  We'll be creating a model that will help us determine which of our patients, based on their medical data, may have diabetes.\n",
    "\n",
    "It is helpful to know a few characteristics about the dataset in this lab.  Specifically how it is laid out.  In the Diabetes dataset we have provided, we have done some preprocessing of the data.  The data you will be provided with is in CSV format.  There are no patient names provided in this dataset.\n",
    "\n",
    "This dataset will be used to train the model to recognize which medical data (e.g. blood pressure, blood values) are common in patients who have diabetes.\n",
    "\n",
    "At the end of this notebook, we'll have a better understanding of what our data looks like, and that will help us when constructing the AI in the future notebook.\n",
    "\n",
    "To start, let's load the data!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load diabetes dataset into a dataframe called 'df'\n",
    "df = pd.read_csv(\"../data/diabetes.csv\")\n",
    "\n",
    "#Obtain the length (rows) and shape (columns) of our dataset\n",
    "len(df)\n",
    "df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that there are 769 rows and 9 columns.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Analysis\n",
    "Let's analyze our data closer using some standard data analysis methods available in python.  A useful method is info() which will show us the column names and data types in our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see the column names and associated data types.  Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, Age and Outcome are of data type integer (e.g. 34, 2, 98).\n",
    "\n",
    "BMI and Diabetes PedigreeFuntion are of data type float (e.g. 1.3, 4.56)\n",
    "\n",
    "Let's use the describe() function to learn statistical information (e.g. mean, standard deviation, min, max...) for each of our columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that there are 769 entries in our dataset.  There are 9 columns and they are all non-null."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset consists of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.\n",
    "\n",
    "* Pregnancies    - Number of times pregnant\n",
    "* Glucose        - Plasma glucose concentration a 2 hours in an oral glucose tolerance test\n",
    "* Blood Pressure - Diastolic blood pressure (mm Hg)\n",
    "* SkinThickness  - Triceps skin fold thickness (mm)\n",
    "* Insulin        - 2-Hour serum insulin (mu U/ml)\n",
    "* BMI            - Body mass index (weight in kg/(height in m)^2)\n",
    "* DiabetesPedigreeFunction - Diabetes pedigree function\n",
    "* Age            - Age (years)\n",
    "* Outcome        - Class variable (0 or 1) 268 of 768 are 1, the others are 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Let's now look at our data using a series of histograms and plots in order to see how 'distributed' our data appears.  We can do this using the hist() function to visually see how our data is distributed for each column in our data set.  This will allow us to determine if any of the columns have 'outliers'.  \n",
    "\n",
    "Outliers are data points that differ significantly from other data points.  They can be due to a variability in measurement or may even indicate a measurement error. It is up to the Data Engineer (or Data Scientist) to determine if the outlier(s) are considered abnormal and should be removed or kept within the dataset. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#plot histograms of the columns on multiple subplots\n",
    "plt.close('all')\n",
    "df.diff().hist(color=\"k\", alpha=0.5, bins=20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the above histograms we can see that our dataset is evenly distributed except for some outliers in BMI, Outcomes and DiabetesPedigreeFunction.  For now we will keep the outliers in our dataset.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a closer look at the actual values in our dataset.  A method we can use to examine the first 20 rows of data (plus the column names) is the head() function. Using df.head() we can see the column names of our dataframe:  Pregnancies, Glucose, BloodPressure, Skinthickness, Insulin, BMI, \n",
    "DiabetespedigreeFunction, Age and Outcome"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we look closely we see that there are '0' values found for Insulin levels and skin thickness.  The 0 values indicate that we have missing or null data.  Let's check the last 20 rows of our dataset and determine if there are any '0' values present."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Let's examine the last 20 rows of our dataframe\n",
    "df.tail(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the last 20 rows of our dataset, there are also '0' values present for Insulin and SkinThickness.  Therefore, it is likely that there are more '0' values present throughout our dataset.\n",
    "\n",
    "Let's determine how many '0' values are present in our dataset.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Determine the number of '0' values\n",
    "df2 = df.iloc[:, :-1]\n",
    "\n",
    "#number & percent of '0's for each attribute\n",
    "numZero = (df2[:] == 0).sum()\n",
    "perZero = ((df2[:] == 0).sum())/768*100\n",
    "\n",
    "print(\"\\nRows, Columns: \",df2.shape)\n",
    "print(\"\\nNumber of 0's:\")\n",
    "print(numZero)\n",
    "print(\"\\nPercentage of 0's:\")\n",
    "print(perZero)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "There are 227 Zero values for SkinThickness and 374 Zero values (or ~50%) for Insulin.  If we want to build and train a model we need to address these '0' values.  We need to ask ourselves if the missing data values are in some way related (correlated) to each other.  Let's look for correlations in our dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Looking for Correlations\n",
    "\n",
    "Because our dataset is small we can compute the 'standard correlation coefficient' between every pair of attributes using corr()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "corrM = df.corr()\n",
    "corrM"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When the coefficient is close to 1 it means there is a strong positive correlation.  When we take a look at our data we don't see any attributes that are highly correleated, even our 0 value attributes Insulin & SkinThickness only have a correlation value of 0.436783.  Even though there is not high correleation, the zero values are still eraneous and we shouldn't include them in our model.  Is this a common problem faced?  And how is it solved?  \n",
    "\n",
    "This is a common problem.  And when dealing with missing data, data scientists can use two primary methods to solve the error: imputation or the removal of data. Imputation develops reasonable guesses for missing data such as replacing these values with a median measurement.  But what we don't want to do this before we break the dataset into training and testing sets for our model.  Let's leave this issue until then. In the meantime, since we are fairly happy with our Data Analysis, we can go to the next step which is model development.\n",
    "\n",
    "Open notebook 02-model-development.ipynb to continue.\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
